Data Mining - Pendahuluan Social Network Analysis

Outline SNA :¶

  • Review Graph
  • Centrality Analysis
  • Graph Partitioning
  • Community Detection
  • Discussion on Final Project

Sejarah

  • 1736: Leonhard Euler - Basel, 1707-St. Petersburg, 1786
  • Mampu mengungkap misteri Jembatan Konigsberg di Prussia (saat ini Kaliningrad, Russia)

Graph (Definisi)

    • Suatu Graph G adalah koleksi atau pasangan dari dua himpunan V dan E dengan
    • V = V(G) = himpunan verteks atau simpul atau node.
    • E = E(G) = himpunan edge atau ruas atau sisi.

Graph Types

Shortest Path and Cycles

Degree/Derajat Vertex

Bipartite Graph

Graph Representation

Graph Isomorphism

Pemodelan Graph untuk Penyelesaian Masalah

Graph Applications

Graph in Social Media Analytic

Mari mulai dengan Membuat Graph Sederhana (Toy Data)¶

1. Mendefinisikan Graph¶

2. Menambahkan Vertex dan Edge¶

3. Graph Properties¶

4. Visualisasi Graph¶

In [1]:
# 1. Mendefinisikan Graph (kosong)
import networkx as nx

G = nx.Graph()
G
Out[1]:
<networkx.classes.graph.Graph at 0x27e85f9e200>
In [2]:
V = [1, 2, 7, 9, 12, 19] # Bisa juga string, misal "A" atau nama "Budi"
E = [(1,2), (2,19), (9,2), (9,1), (2,8), (8,10), (12,7),(7,2), (7,9)] # Perhatikan "8" dan "10" TIDAK ADA di V

G.add_nodes_from(V)
G.add_edges_from(E)

print('Banyak vertex = ', G.number_of_nodes())
print('Banyak Edges = ', G.number_of_edges())
Banyak vertex =  8
Banyak Edges =  9
In [3]:
# Draw Graph # https://networkx.github.io/documentation/networkx-1.10/reference/drawing.html#layout
import matplotlib.pyplot as plt

pos = nx.spring_layout(G) # Spring LayOut

nx.draw_networkx_nodes(G,pos, alpha=0.1,node_color='blue',node_size=800) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=1) # Gambar edges
nx.draw_networkx_labels(G,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [4]:
# Ingat Graph Isomorphism?
pos=nx.circular_layout(G) # Circular Layout

nx.draw_networkx_nodes(G,pos) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2) # Gambar edges
nx.draw_networkx_labels(G,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [5]:
pos = nx.random_layout(G) # Random

nx.draw_networkx_nodes(G,pos, alpha=0.2,node_color='blue',node_size=600) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2) # Gambar edges
nx.draw_networkx_labels(G,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [6]:
shell = [[1,2,7,9],[12,19,8,10]]
pos = nx.shell_layout(G, shell) # Shells

nx.draw_networkx_nodes(G,pos, alpha=0.2,node_color='blue',node_size=600) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2) # Gambar edges
nx.draw_networkx_labels(G,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [7]:
pos=nx.spectral_layout(G) # Spectral

nx.draw_networkx_nodes(G,pos, alpha=0.2,node_color='blue',node_size=600) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2) # Gambar edges
nx.draw_networkx_labels(G,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image

PyGraphVis (Demonstration)¶

In [8]:
from graphviz import Digraph

f = Digraph('finite_state_machine', filename='fsm.gv')
f.attr(rankdir='LR', size='8,5')
f.attr('node', shape='doublecircle')
f.node('LR_0'); f.node('LR_3')
f.node('LR_4');f.node('LR_8')
f.attr('node', shape='circle')
f.edge('LR_0', 'LR_2', label='SS(B)'); f.edge('LR_0', 'LR_1', label='SS(S)')
f.edge('LR_1', 'LR_3', label='S($end)'); f.edge('LR_2', 'LR_6', label='SS(b)')
f.edge('LR_2', 'LR_5', label='SS(a)'); f.edge('LR_2', 'LR_4', label='S(A)')
f.edge('LR_5', 'LR_7', label='S(b)'); f.edge('LR_5', 'LR_5', label='S(a)')
f.edge('LR_6', 'LR_6', label='S(b)'); f.edge('LR_6', 'LR_5', label='S(a)')
f.edge('LR_7', 'LR_8', label='S(b)'); f.edge('LR_7', 'LR_5', label='S(a)')
f.edge('LR_8', 'LR_6', label='S(b)'); f.edge('LR_8', 'LR_5', label='S(a)')
f
Out[8]:
No description has been provided for this image
In [9]:
# Hapus Vertex/Edges
try:
    G.remove_node(19)
except Exception as err_:
    print(err_)
try:
    G.remove_edge(1, 2)
except Exception as err_:
    print(err_)

pos=nx.spectral_layout(G) # Spectral

nx.draw_networkx_nodes(G,pos, alpha=0.1,node_color='blue',node_size=800) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2, alpha=0.1) # Gambar edges
nx.draw_networkx_labels(G,pos, font_size=14) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [10]:
# Degree
G.degree[10] 
Out[10]:
1

(Shortest) Path

In [11]:
print(nx.shortest_path(G, source=9, target=10))


pos=nx.spectral_layout(G)
nx.draw_networkx_nodes(G,pos, alpha=0.1,node_color='blue',node_size=800) # Gambar Vertex
nx.draw_networkx_edges(G,pos,width=2, alpha=0.1) # Gambar edges
nx.draw_networkx_labels(G,pos, font_size=14) #Gambar Label Nodes
plt.show() # Show the graph
[9, 2, 8, 10]
No description has been provided for this image

Graph From Social media

Mentions, Followers, Friends

In [12]:
# Loading Data 
import pandas as pd, re, operator, numpy as np
from tqdm import tqdm

file_ = 'data/sample-twitter-data.csv'

try:
    import google.colab; IN_COLAB = True
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/{file_}
except:
    IN_COLAB = False

Tweets = pd.read_csv(file_)
Tweets.head()
Out[12]:
id_str username screen_name created_at friends_count followers_count favourited_count statuses_count full_text description text_created_at text_favourites_count text_retweets_count label
0 1.46E+18 Aqilacute10 Aqila_cute Mon Nov 01 19:00:53 +0000 2021 188 745 1671 2222 Jika kau yakin dan percaya, maka mimpimu pun a... Lulusan al-Azhar · Qari' · Pecinta Quran\n\nDo... Fri Jul 08 02:02:23 +0000 2022 5 2 1
1 1.50E+18 Kemilau16104667 Kemilau Tue Feb 22 23:11:01 +0000 2022 343 145 2 630 Pemerintah serahkan pengelolaan dua menara rus... NaN Sat Jul 16 00:07:32 +0000 2022 0 0 1
2 163586591 nasehatmuslim Baik Berkata & Bertindak Tue Jul 06 20:06:35 +0000 2010 562 662 5943 15356 Rasulullah Muhammad bersabda,\n\nTidak boleh m... Pengusaha, peneliti, dosen, wakil ketua, dai, ... Mon Jul 18 14:45:36 +0000 2022 1 0 1
3 1.43E+18 putrisrikandi_ srikandi_ Mon Sep 06 08:51:51 +0000 2021 444 376 648 7204 Pemerintah lakukan serah Terima status penggun... 🇮🇩🇮🇩🇮🇩🇮🇩🇮🇩 Sat Jul 16 01:19:24 +0000 2022 0 0 1
4 1.53E+18 arisa_fadillah Fadillah Arisa Thu Jun 09 03:39:10 +0000 2022 31 22 0 21 PT PLN (Persero) menerima kompensasi dari peme... gemuk gak kurus Fri Jul 08 01:53:34 +0000 2022 0 1 1
In [13]:
# Draw the Tweet Graph
G=nx.Graph()
for i, tweet in tqdm(Tweets.iterrows()):
    if tweet.screen_name not in G.nodes():
        G.add_node(tweet.screen_name)
    mentionS =  re.findall("@([a-zA-Z0-9]{1,15})", tweet['full_text'])
    for mention in mentionS:
        if "." not in mention: #skipping emails
            usr = mention.replace("@",'').strip()
            if usr not in G.nodes():
                G.add_node(usr)
            G.add_edge(tweet.screen_name, usr)
Nn=G.number_of_nodes();Ne=G.number_of_edges()
print('Finished. There are %d nodes and %d edges in the Graph.' %(Nn,Ne))
3000it [00:00, 11932.91it/s]
Finished. There are 3607 nodes and 2569 edges in the Graph.

In [14]:
# Latihan dengan layout Graph yang lain
fig = plt.figure()
fig.add_subplot(111)
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, alpha=0.2, node_color='blue', node_size=10)
#nx.draw_networkx_labels(G, pos)
nx.draw_networkx_edges(G, pos, width=1)
plt.show()
No description has been provided for this image

I. Centrality Analysis

Bertujuan untuk menemukan pengguna yang paling berpengaruh dalam suatu topik pembicaraan di media sosial. Analisanya biasanya dilakukan melalui data graph dari hubungan jaringan pertemanan (follower/friend) antar pengguna atau komunikasi antar pengguna (mentions).

Centrality by Degree

Apakah interpretasinya?¶

In [15]:
# https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html
N = 10
ranking = nx.degree_centrality(G)
important_nodes = sorted(ranking.items(), key=operator.itemgetter(1))[::-1]#[0:Nimportant]
Mstd = 1 # 1 standard Deviation CI
data = np.array([n[1] for n in important_nodes])
out = len(data[abs(data - np.mean(data)) > Mstd * np.std(data)]) # outlier within m stDev interval
if out>N:
    dnodes = [n[0] for n in important_nodes[:N]]
    print('Influencial Users: {0}'.format(str(dnodes)))
else:
    dnodes = [n[0] for n in important_nodes[:out]]
    print('Influencial Users: {0}'.format(str(important_nodes[:out]))) 
Gt = G.subgraph(dnodes)
Influencial Users: ['brin', 'Gerindra', 'prabowo', 'Hasbil', 'DivHumas', 'ListyoSigitP', 'gerindra', 'Psadt', 'al', 'ekowboy2']
In [16]:
fig = plt.figure()
fig.add_subplot(111)
pos = nx.circular_layout(Gt)
nx.draw_networkx_nodes(Gt, pos, alpha=0.2, node_color='blue', node_size=600)
nx.draw_networkx_labels(Gt, pos)
nx.draw_networkx_edges(Gt, pos, width=1)
plt.show()
No description has been provided for this image

Closeness Centrality

In [17]:
# https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.closeness_centrality.html
N = 10
ranking = nx.closeness_centrality(G)
important_nodes = sorted(ranking.items(), key=operator.itemgetter(1))[::-1]#[0:Nimportant]
Mstd = 1 # 1 standard Deviation CI
data = np.array([n[1] for n in important_nodes])
out = len(data[abs(data - np.mean(data)) > Mstd * np.std(data)]) # outlier within m stDev interval
if out>N:
    dnodes = [n[0] for n in important_nodes[:N]]
    print('Influencial Users: {0}'.format(str(dnodes)))
else:
    dnodes = [n[0] for n in important_nodes[:out]]
    print('Influencial Users: {0}'.format(str(important_nodes[:out]))) 
Gt = G.subgraph(dnodes)

fig = plt.figure()
fig.add_subplot(111)
pos = nx.circular_layout(Gt)
nx.draw_networkx_nodes(Gt, pos, alpha=0.2, node_color='blue', node_size=600)
nx.draw_networkx_labels(Gt, pos)
nx.draw_networkx_edges(Gt, pos, width=1)
plt.show()
Influencial Users: ['brin', 'blegedezLU', '🐣🐔🐓 VAXXED 3rd SHOT!', 'Pak Afiq', 'WeePHa Porever', 'KaderMilitan', "Maulin Ni'am", 'Supriyadi', '#PJMenangOmdo', 'Lantjar Djaya']
C:\anaconda\envs\Teaching\lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 128035 (\N{HATCHING CHICK}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
C:\anaconda\envs\Teaching\lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 128020 (\N{CHICKEN}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
C:\anaconda\envs\Teaching\lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 128019 (\N{ROOSTER}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
No description has been provided for this image

Betweenness Centrality

In [18]:
# https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html
N = 10
ranking = nx.betweenness_centrality(G)
important_nodes = sorted(ranking.items(), key=operator.itemgetter(1))[::-1]#[0:Nimportant]
Mstd = 1 # 1 standard Deviation CI
data = np.array([n[1] for n in important_nodes])
out = len(data[abs(data - np.mean(data)) > Mstd * np.std(data)]) # outlier within m stDev interval
if out>N:
    dnodes = [n[0] for n in important_nodes[:N]]
    print('Influencial Users: {0}'.format(str(dnodes)))
else:
    dnodes = [n[0] for n in important_nodes[:out]]
    print('Influencial Users: {0}'.format(str(important_nodes[:out]))) 
Gt = G.subgraph(dnodes)

fig = plt.figure()
fig.add_subplot(111)
pos = nx.circular_layout(Gt)
nx.draw_networkx_nodes(Gt, pos, alpha=0.2, node_color='blue', node_size=600)
nx.draw_networkx_labels(Gt, pos)
nx.draw_networkx_edges(Gt, pos, width=1)
plt.show()
Influencial Users: ['brin', 'dara', 'msaid', 'ListyoSigitP', 'Gerindra', 'prabowo', 'Istiqomah', 'DivHumas', 'B', 'Jonidi']
No description has been provided for this image

Eigenvector Centrality¶

Digunakan juga oleh Google dalam Algoritma PageRank-nya untuk menentukan halaman web terpenting.¶

In [19]:
N = 10
phi = 1.618033988749895 # largest eigenvalue of adj matrix
ranking = nx.katz_centrality_numpy(G,1/phi)
important_nodes = sorted(ranking.items(), key=operator.itemgetter(1))[::-1]#[0:Nimportant]
Mstd = 1 # 1 standard Deviation CI
data = np.array([n[1] for n in important_nodes])
out = len(data[abs(data - np.mean(data)) > Mstd * np.std(data)]) # outlier within m stDev interval
if out>N:
    dnodes = [n[0] for n in important_nodes[:N]]
    print('Influencial Users: {0}'.format(str(dnodes)))
else:
    dnodes = [n[0] for n in important_nodes[:out]]
    print('Influencial Users: {0}'.format(str(important_nodes[:out]))) 
Gt = G.subgraph(dnodes)

fig = plt.figure()
fig.add_subplot(111)
pos = nx.circular_layout(Gt)
nx.draw_networkx_nodes(Gt, pos, alpha=0.2, node_color='blue', node_size=600)
nx.draw_networkx_labels(Gt, pos)
nx.draw_networkx_edges(Gt, pos, width=1)
plt.show()
Influencial Users: ['PresidenKopi', 'keuangannews', 'El', 'penegak kejujuran', 'Kemenparekraf', 'Komisaris Politik', 'PutraErlangga95', 'LaNyalla Academia (LNM) #hapuspt20persen', 'wishnutama', 'visiau']
No description has been provided for this image
In [20]:
# Menggunakan centrality measure untuk merubah tampilan Graph
## Contoh dengan graph generators https://networkx.github.io/documentation/networkx-1.10/reference/generators.html

#g = nx.dorogovtsev_goltsev_mendes_graph(3)
g = nx.karate_club_graph()

pos = nx.spring_layout(g) # Spring LayOut
nx.draw_networkx_nodes(g,pos, alpha=0.2,node_color='blue',node_size=600) # Gambar Vertex
nx.draw_networkx_edges(g,pos,width=2,alpha=0.1) # Gambar edges
nx.draw_networkx_labels(g,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [21]:
# Menggunakan Centrality measure (misal degree) untuk merubah ukuran node

K = 100 # Scale factor
d = nx.degree(g) 
d = [d[node]*K for node in g.nodes()]

pos = nx.spring_layout(g) # Spring LayOut
nx.draw_networkx_nodes(g,pos,node_size=d) # Gambar Vertex
nx.draw_networkx_edges(g,pos,width=2,alpha=0.15) # Gambar edges
nx.draw_networkx_labels(g,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image
In [22]:
# Menggunakan tingkat "kepentingan" sebagai warna
ranking = nx.degree_centrality(g)
warna = list(ranking.values())
print(warna)
[0.48484848484848486, 0.2727272727272727, 0.30303030303030304, 0.18181818181818182, 0.09090909090909091, 0.12121212121212122, 0.12121212121212122, 0.12121212121212122, 0.15151515151515152, 0.06060606060606061, 0.09090909090909091, 0.030303030303030304, 0.06060606060606061, 0.15151515151515152, 0.06060606060606061, 0.06060606060606061, 0.06060606060606061, 0.06060606060606061, 0.06060606060606061, 0.09090909090909091, 0.06060606060606061, 0.06060606060606061, 0.06060606060606061, 0.15151515151515152, 0.09090909090909091, 0.09090909090909091, 0.06060606060606061, 0.12121212121212122, 0.09090909090909091, 0.12121212121212122, 0.12121212121212122, 0.18181818181818182, 0.36363636363636365, 0.5151515151515151]
In [23]:
pos = nx.spring_layout(g) # Spring LayOut
nx.draw_networkx_nodes(g,pos, node_color=warna,node_size=d) # Gambar Vertex
nx.draw_networkx_edges(g,pos,width=2,alpha=0.1) # Gambar edges
nx.draw_networkx_labels(g,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image

Summary

Coba diskusikan optimal use case dari masing-masing metric ini.¶

II. Community Detection (CD)

CD dilakukan pada data jaringan media sosial untuk menemukan komunitas-komunitas dalam pertemanan atau pembicaraan di media sosial. Secara sederhana CD dapat dimengerti sebagai proses "semacam clustering" (pengelompokan) , namun atas suatu graph.

Bipartition (Bisection) Partitioning¶

  • This algorithm paritions a network into two sets by iteratively swapping pairs of nodes to reduce the edge cut between the two sets.
  • https://www.youtube.com/watch?v=MMlf66PQdN8
  • Paper: Kernighan, B. W.; Lin, Shen (1970). “An efficient heuristic procedure for partitioning graphs.” Bell Systems Technical Journal 49: 291–307. Oxford University Press 2011.
In [24]:
B = nx.algorithms.community.kernighan_lin_bisection(g)
B
Out[24]:
({8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33},
 {0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 16, 17, 19, 21})
In [25]:
warna = []
for v in B[0]:
    warna.append(1)
for v in B[1]:
    warna.append(2)
In [26]:
pos = nx.shell_layout(g, B)
nx.draw_networkx_nodes(g,pos, node_color=warna,node_size=d) # Gambar Vertex
nx.draw_networkx_edges(g,pos,width=2,alpha=0.1) # Gambar edges
nx.draw_networkx_labels(g,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image

Graph Clustering via "Modularity"¶

  • Modules biasa disebut juga groups, clusters atau communities
  • Terdapat berbagai cara dalam menghitung "Modularity" (contoh dibawah)
  • Graph with high modularity have dense connections between the nodes within "modules" but sparse connections between nodes in different modules.
  • Salah satu metodenya : Greedy Modularity Maximization (GMM)
  • GMM begins with each node in its own community and joins the pair of communities that most increases modularity until no such pair exists.
  • Clauset, A., Newman, M. E., & Moore, C. “Finding community structure in very large networks.” Physical Review E 70(6), 2004.
  • Other resources for study: https://slideplayer.com/slide/7050174/
In [27]:
# WARNING!!!... Hanya bisa jika networkX versi 2.2 ke atas 
M = nx.algorithms.community.greedy_modularity_communities(g)
print(M)
[frozenset({8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}), frozenset({1, 2, 3, 7, 9, 12, 13, 17, 21}), frozenset({0, 16, 19, 4, 5, 6, 10, 11})]
In [28]:
W = []
warna = 1
for module in M:
    for node in module:
        W.append(warna)
    warna = warna +1
print(W)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3]
In [29]:
pos = nx.shell_layout(g, M)
nx.draw_networkx_nodes(g,pos, node_color=W,node_size=d) # Gambar Vertex
nx.draw_networkx_edges(g,pos,width=2,alpha=0.1) # Gambar edges
nx.draw_networkx_labels(g,pos) #Gambar Label Nodes
plt.show() # Show the graph
No description has been provided for this image

Discussion on your final Project¶

End of Lectures


See you in your project presentations